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Abstract 

We propose a framework for dealing with 
binary hard-margin classification in Banach 
spaces, centering on the use of a supporting 
semi-inner-product (s.i.p.) taking the place 
of an inner-product in Hilbert spaces. The 
theory of semi-inner-product spaces allows 
for a geometric, Hilbert-like formulation of 
the problems, and we show that a surpris¬ 
ing number of results from the Euclidean case 
can be appropriately generalised. These in¬ 
clude the Representer theorem, convexity of 
the associated optimization programs, and 
even, for a particular class of Banach spaces, 
a “kernel trick” for non-linear classification. 


1 Introduction 

The theory of classical Support Vector Machines hav¬ 
ing attained an enviable apogee of mathematical com¬ 
pleteness, empirical success and aesthetic coherence, 
efforts in the Machine Learning community have re¬ 
cently turned to possible extensions of its basic frame¬ 
work. Perhaps the prime restriction in the standard 
theory is the assumption that the training data (or its 
features) lie in a Hilbert space. This choice is nat¬ 
ural as an initializing point: the geometry of Hilbert 
spaces is well understood, and the bilinearity of the in¬ 
ner product makes a thorough-going analysis possible. 
The simplicity of its structure also firmly marks its lim¬ 
itations — most data, for instance, do not come with 
any natural notion distance that can be induced from 
an inner-product. Hilbert spaces are also somewhat 
pedestrian: since all Euclidean spaces of the same ba¬ 
sis cardinality are isometrically isomorphic [3], there 
is, in a sense, only one inner-product space. 

The search for generalizations is then a search for al¬ 
gebraic/analytic structures more accommodating to 
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larger classes of data, and more representative of com¬ 
plex distance relations. In the linear algebra hierarchy, 
the norrned spaces and their complete cousins, the Ba¬ 
nach spaces, inhabit the niche one place removed from 
the Hilbert spaces. These are spaces where lengths of 
vectors have been defined (and hence distances), but 
not angles. Research on large-margin classification in 
Banach spaces has already been initiated [1, 13], and 
even more generally in metric spaces [6, 10]; in an al¬ 
ternative direction [11] considers classification in Krein 
spaces, i.e. spaces with non-positive inner-products. 
Arguably, the Banach space setting still occupies the 
center-stage of scrutiny, for a number of reasons. First, 
vector space structures come ready-equipped with a 
number of convenient properties and objects; most im¬ 
portantly for classification problems, that of linear op¬ 
erators and functionals, and hence hyperplanes. Sec¬ 
ondly, the introduction of a norm (and more generally, 
metric) allows the preservation of the notion of mar¬ 
gin. These two objects: hyperplane and margin, might 
be construed as the minimal set of concepts required 
to construct a geometric generalization of classification 
in Hilbert spaces; if so assumed, the Banach space as¬ 
sumption then becomes the natural choice of minimal 
structure required to support these two notions. Fi¬ 
nally, classification problems in cases where the data 
do not possess linear structure may still be attacked via 
normed-space ideas. For example, every metric space 
can be isometrically embedded into a Banach space: 
a number of constructions exist [6, 10]. Procedures 
developed for norrned spaces may lead directly to al¬ 
gorithms for classification in the more general metric 
spaces. 

What can replace the inner-product in non-Hilbert Ba¬ 
nach spaces? G. Lumer in [9] introduced the notion 
of semi-inner product (s.i.p.) spaces: norrned vector 
spaces with a type of inner-product satisfying many, 
but not all, of the axioms of a Hilbert inner-product. 
Crucially, every Banach space can be represented by 
a (not necessarily unique) s.i.p., a form with sufficient 
structure to carry over to the Banach space setting 




a number of Hilbert-space-type arguments. Indeed, 
many concepts seemingly unique only to Hilbert spaces 
find counterparts in normed spaces, via the semi-inner- 
product machinery. To give a sample selection: the 
Riesz Representation theorem and duality mappings, 
orthogonality relations, and generalizations of special 
concepts for Hilbert operators such as Hermiticity and 
numerical range. 

This paper outlines a theory of large-margin binary 
classification in Banach spaces, where the central re¬ 
sults are derived and couched in a semi-inner prod¬ 
uct formalism. In particular, we focus on a certain 
well-behaved class of Banach spaces: the uniformly 
smooth and uniformly convex spaces. Roughly speak¬ 
ing, these are structures possessing a type of Riesz 
representation theorem and include, for example, the 
L p spaces, 1 < p < oo. In such spaces, the entire in¬ 
frastructure for linear classification so well-studied in 
Hilbert spaces “goes over”, more or less with the s.i.p. 
replacing the inner product. Indeed, we prove that 
the maximum-margin problem becomes well-posed, we 
establish a finite-dimensional linear Representer theo¬ 
rem, and show that the coefficients of the classifier are 
obtained through a convex (non-quadratic) optimiza¬ 
tion problem; remarkable facts given that the support¬ 
ing s.i.p.’s are not bilinear in general. 

For a special class of Banach spaces, L 2p , p an integer, 
a complete theory, extending to the case of non-linear 
classifiers, can be developed, as a parallel to kernel 
methods for Hilbert space classification. Here, mo¬ 
ment functions replace kernel functions to give a ver¬ 
sion of the kernel trick. The available types of depen¬ 
dency relations becomes significantly broadened in this 
theory, utilising as it does 2p-th order statistics instead 
of the second-order statistics of the inner-product. 

We begin with a primer on s.i.p. spaces, collecting a 
number of results culled from the mathematical lit¬ 
erature. Section 3 discusses the application of these 
results to classification in Banach space, by deriving, 
from the geometric s.i.p. point of view, two formula¬ 
tions of hyperplane classification, one an optimization 
over the learning domain, and the other in the dual 
of continuous linear functionals. The s.i.p. machin¬ 
ery shows that the optimization in the dual space can 
always be made a simple convex problem with affine 
constraints. A Representer theorem is then proved, 
demonstrating that it suffices to consider only linear 
functionals on the space spanned by the data; this re¬ 
sult allows us to formulate a finite-dimensional convex 
program for the hyperplane coefficients using standard 
ideas from Lagrange optimization. Finally, the latter 
sections discuss how to obtain non-linear classifiers for 
the case L 2p , via moment functions, in an analagous 
generalization of Hilbert SVM theory. 


2 Semi-Inner-Product Spaces 

Let us collect a number of important results on semi- 
inner-products useful in the sequel. For simplicity of 
discourse, and with a view toward applications, we 
have not always provided the most general conditions 
for each statement: for optimal assumptions the refer¬ 
ences may be consulted. 

Definition 1. Let (X, || • ||) be a real Banach space. 
A semi-inner-product (s.i.p.) on X is a real function 
( x , y) on X x X with the properties 1 

1. (Linearity in second argument) (x,yi + yf) = 
(x,yi) + {x, yf) 

2. (Homogeneity) ( ax,y) = (x,ay) = a(x,y) 

3. (Norm-inducing) (x, x) = \\x\\ 2 

4- (Cauchy-Schwartz) (x, y) < ||a:|| ||y|| 


Semi-inner-products are not usually linear in their first 
argument, nor symmetric, unless the space is Hilber- 
tian, in which case the s.i.p. coincides with the inner 
product. The Hahn-Banach theorem gives the exis¬ 
tence of a s.i.p. for every Banach space, without pro¬ 
viding any explicit description for the possible sup¬ 
porting s.i.p’s, nor conditions under which the s.i.p. is 
unique. The special case of smooth Banach spaces (i.e. 
where the norm is Gateaux differentiable) suffices to 
ensure uniqueness of the representation, as well as an 
explicit form for the s.i.p. in terms of the norm: 

Theorem 1. [4] A Banach space X has unique s.i.p. 
if and only if it is smooth, in which case 


(x, y) = lim 

A— 


||.t + \y\\ 2 
2A 


(1) 


The above result is highly apposite for the calculation 
of s.i.p.’s, and shows that the semi-inner products are 
essentially directional derivatives of the square norm. 

It will be desirable to consider classes of Banach spaces 
not only with differentiable norm, but which satisfy the 
following uniform convexity property: for each e > 0, 
there exists 6 > 0 such that ||x+ 2 /||/ 2 < 1 — S whenever 
||:r — y || > e for all x, y in the unit ball. Such spaces 
have several important characteristics; they are reflex¬ 
ive, and the infimum distance between closed convex 
sets C and a given point Xo is actually achieved by 

x The notation ( x , y) for the s.i.p. must not be confused 
for the similar notation (x*,y) = x*(y) sometimes em¬ 
ployed for the evaluation of a linear functional x* in the 
dual X* at the point y £ X. See, however, the generalized 
Riesz representation theorem of Theorem 2, where the two 
notations become somewhat unified. 



on B is defined by 


some vector c £ C [8]. We shall see that the uni¬ 
form convexity assumption guarantees the existence r^, ^ 

and uniqueness of a maximum-margin hyperplane so- (Si, S 2 ) = — 1; _ 9 2 (3) 

lution. a Si 


When a Banach space is both uniformly smooth and 
uniformly convex, one obtains a set of satisfying 
Hilbert-like duality properties: 

Theorem 2. Let X be a uniformly smooth and uni¬ 
formly convex Banach space with s.i.p. (•, •) and dual 
X*. Then: 


for Si , S '2 £ B. The Gaussian case a = 2 gives 
(S 1 ,S 2 ) = ICo v(S 1 ,S 2 ). 

One also has the peculiar representation, from a for¬ 
mula of Cambanis [12]: 


(Si, £2} = 


o^ES? 


1 < p < a 


i) [3] (General Riesz Representation) For each con¬ 
tinuous linear functional f £ X*, there exists a 
unique vector w £ X such that f(x) = (w,x) 

ii) [2] The dual X* is a uniformly smooth and 
uniformly convex Banach space supported by 
the semi-inner-product defined by (f Wl , f W2 ) = 
(w 2 ,wi), where f Wi is the linear functional asso¬ 
ciated with Wi £ X. 

Remark: We stress the alternation of positions of 

variables in the dual s.i.p. 

Examples: All of the foregoing theorems are instruc¬ 
tively illuminated in the following concrete situations. 

1. X = L p (Q,p). For 1 < p < 00 , these Banach 
spaces are readily confirmed to be uniformly smooth 
and uniformly convex; this is not so for p = 1 or 
p = 00 . Let ifip : L p (VL,p) —* L q (Q,p) be defined 2 
by ip p (x) = f , ip p ( 0) = 0 through continuity, 
and - + - = 1. Then p p is a norm-preserving 
(||x|| p = |M*)|| g ) bijection, with inverse ip q , and 
( x,y) p = J n ip p (x)y dp, defines the unique semi-inner- 
product on X = L p (Cl, ji). 

2. Stable Processes. Let S(x) be a symmetric a-stable 
random process. The span of (S(x)) x ^x is a vector 
subspace B of the space of all stable random variables, 
and can be endowed with a norm: the spread a(S(x)) 
(cf. [12]). This Banach space has an s.i.p. representa¬ 
tion as follows. Define the covariation between S(x±) 
and S(x 2 ) as 

[£(a+), S(a’ 2 )] = [ s[ a ^ 1) s 2 dT (2) 

Js 1 

where si and S 2 are (a:, y) coordinates on the unit- 
circle, and P the spectral measure for the pair 
(S(x\), S(x 2 ))- Then the (unique) semi-inner-product 

2 We employ the notation for the signed power function 
= |a| i> sgn(a) for a £ R and b > 0, and the natural 
component-wise extension for a a vector or function. 


Semi-inner products induce a notion of orthogonality 
in normed linear spaces often helpful for geometric in¬ 
tuition: we define x X y iff (x, y) = 0. Note that be¬ 
cause of asymmetry of the s.i.p., this notion of orthog¬ 
onality is not usually symmetric. Seen in this light, 
the Riesz representation theorem of Theorem 3(i) is a 
generalization of the observation that in a (d— 1)- 
dimensional hyperplane passing through the origin is 
parameterized by a given normal vector w (in the s.i.p. 
sense) to the plane. It is not difficult to see from (1) 
that s.i.p. orthogonality coincides with the following 
notion of “minimum-distance” orthogonality in real 
normed linear spaces introduced by R. James [7, 5]: 
x T y iff ||.t + Ay11 > ]|ic|| for all A £ R. It follows 
that many problems of best approximation in Banach 
spaces are naturally formulated in terms of semi-inner- 
products. 

3 Hard-Margin Binary Classification 
in Banach Spaces 

Our aim is to develop a semi-inner-product formula¬ 
tion of the maximal-hard-margin linear classification 
problem in Banach spaces. The advantage of such an 
approach over other developments, such as [1, 13], is 
that s.i.p. arguments emulating the Hilbert case be¬ 
come available. This allows us to go considerably far¬ 
ther in the development of a parallel theory. Moreover, 
the s.i.p. economically and clearly mediates between 
two equivalent formulations: one in the learning do¬ 
main and one in the dual space. We shall see that in 
the general Banach space case, these two formulations 
are rather different, whereas in the Hilbert case they 
coincide since the dual of a Hilbert space is isometri- 
cally isomorphic to itself (i.e. in some sense self-dual). 

Henceforth let us assume that the learning domain X 
is a uniformly smooth, uniformly convex Banach space 
with s.i.p. (•, •). 

Lemma 1. Given w £ X, let H = {x £ X : 
(w,x) +6 = 0} be a hyperplane in X. Then the dis¬ 
tance between xo and H is d = inf^g# |}rro — a:|| = 
|H| -1 |(w,xo) + 6|. 



Proof. This is simply Theorem 1 of [5], recast in the 
language of s.i.p.’s, via Theorem 3. ■ 

Now let training points {xi,yi} r fL 1 G X be given, 
where yt = ±1. If the data are linearly separable, 
then there exists a (continuous) linear functional f(x) 
and an offset b G R such that yi(f{xi)+b) > 0, for all i. 
By Theorem 3, there exists a vector w G X such that 
f(x) = (w,x). By rescaling w and b , using homogene¬ 
ity of the s.i.p., and Lemma 1, we may assume without 
loss of generality that the point (s) closest to the hy¬ 
perplane H = (w, x) + b satisfy \(w, a;*} + b\ = 1. Thus 
H may be placed in the canonical form H = ( w,b ), 
with yi((w,Xi) + b) > 1, for all i. With this form, it 
is also now immediate from Lemma 1 that the margin 
of the hyperplane is ||u;|| _1 . We have then derived: 

Data Domain Optimization for Maximum- 
Margin Banach Linear Classifier 


inf 

•we.v,&eR 


w\\x 


s.t. yi{{w,Xi) x + b) > 1 


(5) 


The classifier is given by /( x) = sgn((ry, x) + b). 

This of course is the usual hard-margin formulation in 
Hilbert spaces, with the inner product replaced by the 
semi-inner-product. Posing the problem in the dual 
space, through Theorem 3, we have 

Dual Domain Optimization for Maximum- 
Margin Banach Linear Classifier 


inf IKH*. 

w*eX*,b£ R 

s.t. yi((x*,w*) x * +b) > 1 


( 6 ) 


It is instructive to compare the two problems (5) 
and (6). The key difference lies in the nature of the 
constraints: since the semi-inner product is linear in 
the second variable but generally non-linear in the 
first, one sees that the data-domain formulation gives 
rise to an optimization problem non-linear in its con¬ 
straints, and in general non-convex, whereas the dual¬ 
form problem (6) gives rise to a convex optimization 
with linear constraints. Put another way, there exists 
an appropriate duality mapping (change of variables) 
of the non-convex problem (5) to the convex problem 
( 6 ). 


make the problem well-posed. Without loss of gen¬ 
erality, assume 6 = 0. Define the sets 5) = {to* : 
yi((x*,w*)x*) > 1}; these are closed, convex subsets 
of X* for each i. Problem (6) can now be viewed as 
the task of finding the point in DjS'j closest to 0. As 
alluded to above, it is a standard fact from elementary 
functional analysis (c.f. [8]) that, given a point z in a 
uniformly convex Banach space X, and a closed convex 
subset C of A, there exists a unique point c G C such 
that \\z — c11 = infe/gc* \\z ~ c'||. This fact immediately 
produces 

Theorem 3. The solution to (5) and (6) exists and 
is unique, for uniformly smooth and uniformly convex 
Banach spaces X. 

3.2 Form of the solution: A Linear 
Representer Theorem 

The optimization problem (6) is posed, in general, 
in an infinite-dimensional space. However, since the 
number of data points is finite, one intuitively ex¬ 
pects that the optimal solution should depend only 
on the metric relations between the data points; in 
other words, that one need only search in the space of 
functionals on the finite-dimensional space spanned by 
the data. Another way to put this is that the problem 
should not depend on the ambient space in which the 
data is embedded in. Such a theorem, in the Hilbert- 
space case, is known as the Representer Theorem. We 
shall prove this result now for uniformly smooth and 
uniformly convex Banach spaces. 

The proof is constructed in two steps. The first, of in¬ 
terest in its own right, is the establishment of the ne¬ 
cessity of a KKT-like condition for optimization prob¬ 
lems with affine constraints in the generality of reflex¬ 
ive Banach spaces. The second step involves the com¬ 
putation of the associated Frechet derivatives and an 
application of the semi-inner-product formalism to de¬ 
rive the required hyperplane representation. 

For the moment, consider the more general setting of 
a reflexive Banach space B, and a differentiable cost 
function / : B —> R. Suppose a solution to the linearly 
constrained problem 

min f(x) (7) 

s.t. bi + gi{x) >0, V* = 1,..., n 


3.1 Existence and Uniqueness of the Solution 

In general, there is no guarantee that the minimizer 
to the program (6) is unique: indeed it may not even 
exist. Simple counterexamples can be found in L 1 , 
and the spaces of continuous functions, for instance. 
However, the imposition of uniform convexity does 


is sought, where are continuous linear func¬ 

tionals in the dual B*, and bi G R shifting constants. 
Denoting by D x f G B* the derivative of / at x, we 
have: 

Theorem 4. Any local minimum x* G B to the opti¬ 
mization problem (7) satisfies D x * f = Y^i=i ^i9i> f or 
some Xi > 0. 



Proof. Critical to the proof is an application of a sepa¬ 
rating hyperplane theorem in the Banach space. Many 
versions exist; one which more than suffices for our 
purposes is: 

Theorem 5. [3] Let A and B be two disjoint non¬ 
empty convex sets in a real topological vector space X, 
with A open. Then there is a continuous linear func¬ 
tional f* £ X*, and a € R such that f*(a) < a < 
f*(b), Va £ A,Mb £ B. 

Now we proceed via contradiction. Suppose that 
D x *f £ C, where C is the convex cone spanned 
by {gi, ■ ■ ■ ,g n }- Since C is closed in B*, there ex¬ 
ists an open ball about D x *f not intersecting C; ap¬ 
plying the separation theorem then gives an element 
s** £ £>**, and a real a such that s**(gi) — a > 0 
for all i, and s**(D x *f) — a < 0. Since s**(0) = 0, 
a < 0; s**(C) > a then implies s**(gi) > 0 and 
s**(D x *f) < 0. Reflexivity of the Banach space im¬ 
plies there exists an s £ B such that s**(x*) = x*(s), 
for all x*, hence we have found an s £ B satisfying 
gi{s) > 0 for all i and (. D x *f){s ) < 0. 

Let 5 > 0, and consider the point x * + 5s. Since 
bi + gi(x* + 5s) > 0, x* + 5s is a feasible point. The 
differentiability of / implies 

f(x* + 5s) - /( x*) = (. D x *f)(5s ) + o(||5s||) (8) 

= 5-(D x .f)(s)+o(\6\) (9) 

Thus there exists 5' sufficiently small such that for all 
0 < 5 < 5', f(x* + 5s) — /( x*) < 0, contradicting the 
assumption that x* is a local minimum. ■ 

Return now to (6), and let w * £ X* be its unique 
global minimizer. Let tu* € B be such that (to*, jx = 
(it exists uniquely by Theorem 3). 

Theorem 6. (A Representer Theorem). The 

maximum-margin separating hyperplane solving (5) 
admits the expansion 

m 

w* = ' s ^a i x i (10) 

i= 1 

for some a* £ R 

Proof. Uniform smoothness and convexity of B implies 
the same for its dual B*. Let f(w*) = ||u>*|| 2 , differen¬ 
tiable because the space is smooth. Theorem 1 shows 
2 (w*,a*)x‘ = (D w *f)(a*). Every uniformly convex 
Banach space is reflexive; Theorem 4 then states there 
exists A i > 0 such that 

m 

2« l(I *),.-^A iK (i*,aV=0 (11) 

2 =1 

m ^ 

(a, w* - Y2 -TVi x i)x = 0 ( 12 ) 


The last line holds for every a £ B, hence in par¬ 
ticular for a = w* — YliL i 2 Vi x i'i this gives ||io* — 
Y^iLx ^tUiXiW 2 = 0, the sought-after result with a, = 
T Vi- ■ 

For the special case of X = L p (Q.,g) spaces, we have 
established: 

Corollary 1. (Representer Theorem for L p Classi¬ 
fier) The maximum-margin hyperplane w* £ X* ad¬ 
mits the expansion (with equality in the L q sense) 

/ m \ 1)) / m \ (P ~-*-) 

W* = l^^OiXi] = I Y, OLjXj I (13) 

Equivalently, if w* £ X is the unique representer for 

2 k 

W+, 

W * = Tg(K) = — ^2 a i x i ( 14 ) 

2 

for a real constant C; equality holding in the L p sense. 

3.3 Lagrange Dual S.i.p. Formulation 

Direct substitution of the representation of Theorem 6 
into (5) gives a finite-dimensional optimization prob¬ 
lem, but one with non-convex constraints. Instead, we 
use Theorem 6 to assume that, without loss of gen¬ 
erality, X = span {xi,. .., x m } and apply standard 
finite-dimensional convex optimization theory to the 
dual-space problem (6). 

Form the Lagrange function to an equivalent convex 
problem of (6): 

1 m 

L(X,w*) = -||w*|||.» + ^ A*(l - yi (b + (x*,w*)x*)) 

i=.l 

Fixing Xi > 0, inf^.g^* L(X, w*) is a convex prob¬ 
lem with differentiable cost. It achieves its unique 
minimum when d w *L = 0 and dbL = 0. The for¬ 
mer has actually already been computed in (11), and 
implies w = i A iUiXp, the latter derivative implies 

= The Lagrange dual function, using 
the Riesz Theorem to revert back to data variables 
becomes 

^ m 

L W = ^\M\ 2 x + ^Z^i( l -yi{ w ^ x i)x) ( 15 ) 

2 =1 

^ m m 

= dMI* - (w^XiViX^x ( 16 ) 

2=1 2=1 
1 m m 

= — o II H X iVi x i\?X + Y1 ( 17 ) 

i= 1 i=l 

The convex dual optimization problem now has the 
structure: 



Lagrange-Dual Optimization for Maximum- 
Margin Banach Linear Classifier 

^ m m 

max -^\\J2 X ^ yiXi \\x+ J2 Xi ( 18 ) 

i— 1 i—1 

m 

s.t. \i > 0, ^2 A iUi = 0 

2=1 

Since the primal dual-space problem (6) is convex with 
affine constraints, strong duality is achieved through 
the Lagrange dual, by standard theorems in finite¬ 
dimensional convex analysis [2]. A solution to (18) 
then gives the solution also to the problem (5), with 
the large-margin classifier attaining the final form 
f(x) = sgn((X)^iA iyiXi,x) x + b). The offset may 
be computed via the standard KKT condition that 
the product of dual variables and constraints must 
vanish [2], i.e. for any i for which A* > 0, b = 
l/Vi — QHjli ^jVjXji x i)x- As in the Hilbert case, 
certain identities hold true: e.g. by summing over 
the aforementioned condition, the property JA A,; = 
||w|| 2 = 1/(margin) 2 holds for the optimal hyperplane. 

One observes then, that the usual SVM Lagrange dual 
optimization for Hilbert spaces generalizes naturally 
and directly to the Banach space case; the crucial dif¬ 
ference being that the resulting classifier inhabits the 
structure of a semi-inner-product rather than an in¬ 
ner product, and hence exhibits a non-linear depen¬ 
dence with respect to the dual coefficients A. Fig¬ 
ure 1 displays a simple configuration of three labelled 
points, and the resulting large-margin classifiers com¬ 
puted with respect to the p-norms p = 1.5, 2, 3 and 4. 



Figure 1: Maximum-margin separating hyperplanes in 

(RMMW- 


3.4 Non-Linear Classifiers in L p , p £ 2Z + 

One of the most arresting ideas in standard SVM the¬ 
ory consists in the kernel trick: the procedure where 
inner-products in a learning domain X are replaced 


by bivariate functions having the effect of im¬ 

plicitly mapping a problem into a (usually) higher¬ 
dimensional Hilbert space 7i, through a feature map 
$ : X —> H. The “trick” consists in selecting a kernel 
whose calculation does not involve explicit knowledge 
of the map 4>, nor an inner-product evaluation in 7 i,. In 
this way, classification may be performed in the high¬ 
dimensional, even infinite-dimensional feature space 'H 
without incurring the expected additional cost of a di¬ 
mensionality increase. 

Having developed an s.i.p. formalism for the Banach 
space binary classification problem, we are led imme¬ 
diately to the question of whether a similar “kernel 
trick” is available for semi-inner-products. An initial 
idea is, in emulation of the Hilbert counterparts, to de¬ 
fine bivariate s.i.p. kernels. Two crucial aspects, how¬ 
ever, become profound obstacles to the establishment 
of a similar “kernel” theory for Banach spaces along 
this route: (1) lack of bilinearity of the s.i.p. prevents 
the classifier (Y^Li KyiXi,x} from being written as a 
function of s.i.p.’s between data points and test points, 
and (2) the s.i.p. does not prescribe the structure of 
a Banach space in as total a way as an inner product; 
for example, an inner product defined on a vector basis 
extends uniquely to the whole space: this is false for 
s.i.p. ’s. 


While there may thus appear to be little hope of es¬ 
tablishing a general kernel theory for Banach spaces, 
nevertheless there is one special class of Banach spaces 
which appear highly amenable to a type of kernel the¬ 
ory: the L p spaces for even integer p. We shall see 
that a certain type of p — variate multi-linear moment 
function can take the place of the bivariate kernel func¬ 
tion for Hilbert spaces; this theory, which we begin to 
delineate here, coupled with the linear theory of the 
previous section combine to give classification tools as 
powerful as the ones for Hilbert spaces. 

Reconsider the X = L p (kl,p) spaces, for even inte¬ 
gers p. It will be more convenient to use the follow¬ 
ing equivalent optimization program to (18) for these 
spaces: 


1 

max- 

AgM m p 


s.t. A i > 0, 


A iViXi dp + ^2 Ai 


y. a iUi =o 

2—1 


(19) 


with the classifier now having type f{x) = sgn(6 + 

In(T,T=i ^Vi x i) P ~ lx du). 

Now we apply the non-linear extension. Let 

the data-points Xi belong to an abstract set 
j\A. and pre-process via a map $ : M —► X 
from the data-domain into an L p feature space. 




The quantities f n (X^i ^iVi^( x i)) P dp and 
faiY^iLi A iyi&(xi)) p ~ 1 x du) may be respectively writ¬ 
ten as p-th and (p — l)-th order polynomials in Ajj/j, 
with coefficients of the form f n ^(x^) ■ ■ ■ $(xi p ) dp 
— p- th order moment functions in the feature space. 
The optimization may then proceed without explicit 
knowledge of <I>, but simply via moment functions 
M (xi,..., x p ) = f n c f > (xi) • • • <&(£ p ) dp. The choice 
of different moment functions implicitly provides 
a selection of non-linear map into an L p space, in 
absolute analogy to the L 2 case, restricting the search 
to a small hypothesis space of possible non-linear clas¬ 
sifiers induced by M. The even-order L p non-linear 
classifier, given a moment function M(x ±,..., x p ), is 
then the solution of the convex program: 

Optimization for Maximum-Margin L p Mo¬ 
ment Classifier 

^ m 

— max V X n y h ■ ■ ■ A t y z M (£*,,..., x t ) 
p A i y y y 

(^1 5 • • • 5^p) = l 

Xi 

i 

m 

s.t. Aj > 0, 5Z A iVi = 0 (20) 

resulting in the non-linear classifier f(x) = sgn (b + 
E(i 1 ,..., ip _ 1 ) = l KVii ''' K-iyip-i M ( x in- • -’ x i P - i’ X ))' 
The offset b is once more computed with b = 1 /y^ — 
,...,i„_i)=l KVil ' ‘ ' A i p _ i yi p _ 1 M(Xi 1 , ■■■ , Xi p _ 1 ,Xk)) 
for any k satisfying A*, >0. 

3.5 Construction of Moment Functions 

To specify a moment function M in an L p space is 
to give not only its semi-inner product (x,y), but 
in fact the general p-tlr order statistic. Moment 
functions therefore contain significantly more infor¬ 
mation than the s.i.p. representation of the Banach 
space, specifying the L p structure in a way simi¬ 
lar to kernel functions for L 2 . Such moment func¬ 
tions will satisfy certain properties: multi-linearity 
in the feature space, exchangeability in its p vari¬ 
ables, positivity (M(x ,..., x) > 0), as well as vari¬ 
ous Holder-like inequalities; to illustrate in the case 
p = 4: M 4 (x,x,x, y) < M 3 (x, x, x, x)M(y, y, y, y) and 
M 4 (x,x,y,y) < M 2 (x,x,x,x)M 2 (y,y,y,y). 

A rich and general source of feature spaces consist of 
spaces of random variables — as a familiar example, 
Gaussian feature maps $. Let k(x,y) be a positive- 
definite kernel on the data-domain Ai = R d , and G(x) 
the associated zero-mean Gaussian random process on 
R d with covariance k; define the feature map $ by 
x —> G(x). The higher-order moments in this case are 


easily calculable in terms of the second-order structure: 

M(xi,...,x p ) = E{G{x-l) ■ • -G{x p )) 

^ > x j ) k{x w , x z )) (21) 

7T 

over permutations 7r € S p , for a total of 
(p — l)!/(2 p / 2_1 (p/2 — 1)!) terms, each a prod¬ 
uct of p/2 kernel terms. For example, p = 
4 gives M(xi,..., x 4 ) = k(x\,X 2 )k(x-$,x±) + 

k(xi, X 3 )k(X 2 ,X 4 ,) + k(xi,X4)k(X2, x 3 ). 

Observe p = 2 gives the usual SVM classifier with 
kernel k. The interpretation for p > 2 is that one 
uses the same feature map $ into the space of Gaus¬ 
sian random variables, but with the geometry in that 
space induced by the p-norm rather than the 2-norm. 
In general, one class of feature spaces arise from the 
probability measure P of a random process $(a:) on 
R d ; the moment functions M are then functions of 
the measure P. However, by using statistics of or¬ 
der higher than 2, we allow dependency structures not 
available to Gaussian processes. In the Hilbert-space 
case, every inner-product kernel can be achieved by a 
Gaussian feature map — this follows from the spectral 
theorem. Not so with the non-linear Banach classifiers. 
Indeed, one may imagine different feature mappings 
$1 and $2 into spaces of random variables sharing the 
same second-order kernel structure, but with different 
p-order statistics; here the Hilbert classifiers agree, the 
Banach classifiers differ. 

Many other moment functions can be generated from 
kernel functions by using feature maps of the type 
$ : x —> f(G(x), for / : R —> R. For example, with 
/ = exp(-) one obtains a log-normal random process 
parameterized by kernel k. Using characteristic func¬ 
tions, trite calculations show that 

M(x 1 ,..., Xp) = exp 

is an admissible moment function for any kernel k. 

Other easy facts concerning the combination proper¬ 
ties of moment functions can be derived with basic 
probability, assuming finite-measure feature spaces. 
We shall include only a brief listing here of the un¬ 
countable variations. Let $1 and <I >2 be two indepen¬ 
dent L p random processes on R d with moment func¬ 
tions Mi and M 2 . Then the product $1 • $2 defines 
a feature map with moment function M = Mi • M 2 , 
and is hence admissible. If is the map taking 
each x to a constant m(x ) = m x : Q —> R, then 
M(£ 1 ,..., x p ) = nLi m ( x i) is an admissible moment 
function. Let Mi and M 2 be two Gaussian 4-th order 
moment functions generated from kernels k± and fe. 




Then 

Mx{x,y,z,w) +k 2 {x,z)ki(y,w) + k 2 (y, z)k 1 (x,w) 

+ k 2 (x,y)ki(z,w) + k 2 (w, z)k\(x, y) 

+ k 2 {w,x)ki(y,z) + k 2 (y,w)ki(x,z) 

+ M 2 (x, y, z, w) 

is admissible. This last result is one 4-th order gener¬ 
alization of the second-order fact that kernels form a 
cone. 

4 Discussion 

We have developed a semi-inner-product formulation 
of the binary classification problem in Banach spaces. 
The main message might be said to be that all of 
Hilbert linear classification theory carries over neatly 
to the case of Banach spaces (at least well-behaved 
ones). The resulting optimization programs are no 
longer quadratic, but remain convex. Finally, even 
kernel theory has its appropriate generalization in the 
spaces L 2p , where moment functions replace kernel 
functions. In our presentation of this non-linear classi¬ 
fication framework, we have made no claims to having 
constructed the complete story, but merely illustrated 
by way of calculations that an analogous theory and 
practice to the Hilbert case exists, and should be fur¬ 
ther studied. Certain apposite questions still beg to be 
answered. What characterizations exist for moment 
functions, in the vein of Mercer’s theorem for kernels? 
Is there a way to produce spectral decompositions of 
the moment tensors? Is there a corresponding notion 
of Reproducing Kernel Banach Space (RKBS), and, if 
so, how does one construct an RKBS given an s.i.p. 
(or stronger, a moment function)? What generaliza¬ 
tion error bounds can be established for Banach clas¬ 
sifiers, and how does one choose the moment functions 
relative to the data? Can non-linear classification be 
extended to general L p spaces, for fractional values of 
p , for instance, by approximating the semi-inner prod¬ 
uct with polynomials of moment functions? 

These and similar points will be addressed in future 
work. 
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